Goto

Collaborating Authors

 nepali language


NepaliGPT: A Generative Language Model for the Nepali Language

Pudasaini, Shushanta, Shakya, Aman, Shrestha, Siddhartha, Bhatta, Sahil, Thapa, Sunil, Palikhe, Sushmita

arXiv.org Artificial Intelligence

After the release of ChatGPT, Large Language Models (LLMs) have gained huge popularity in recent days and thousands of variants of LLMs have been released. However, there is no generative language model for the Nepali language, due to which other downstream tasks, including fine-tuning, have not been explored yet. To fill this research gap in the Nepali NLP space, this research proposes \textit{NepaliGPT}, a generative large language model tailored specifically for the Nepali language. This research introduces an advanced corpus for the Nepali language collected from several sources, called the Devanagari Corpus. Likewise, the research introduces the first NepaliGPT benchmark dataset comprised of 4,296 question-answer pairs in the Nepali language. The proposed LLM NepaliGPT achieves the following metrics in text generation: Perplexity of 26.32245, ROUGE-1 score of 0.2604, causal coherence of 81.25\%, and causal consistency of 85.41\%.


Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali

Duwal, Sharad, Prasai, Suraj, Manandhar, Suresh

arXiv.org Artificial Intelligence

Continual learning has emerged as an important research direction due to the infeasibility of retraining large language models (LLMs) from scratch in the event of new data availability. Of great interest is the domain-adaptive pre-training (DAPT) paradigm, which focuses on continually training a pre-trained language model to adapt it to a domain it was not originally trained on. In this work, we evaluate the feasibility of DAPT in a low-resource setting, namely the Nepali language. We use synthetic data to continue training Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting. We evaluate the adapted model on its performance, forgetting, and knowledge acquisition. We compare the base model and the final model on their Nepali generation abilities, their performance on popular benchmarks, and run case-studies to probe their linguistic knowledge in Nepali. We see some unsurprising forgetting in the final model, but also surprisingly find that increasing the number of shots during evaluation yields better percent increases in the final model (as high as 19.29% increase) compared to the base model (4.98%), suggesting latent retention. We also explore layer-head self-attention heatmaps to establish dependency resolution abilities of the final model in Nepali.


Fine-Tuning Small Embeddings for Elevated Performance

Silwal, Biraj

arXiv.org Artificial Intelligence

Contextual Embeddings have yielded state-of-the-art results in various natural language processing tasks. However, these embeddings are constrained by models requiring large amounts of data and huge computing power. This is an issue for low-resource languages like Nepali as the amount of data available over the internet is not always sufficient for the models. This work has taken an incomplete BERT model with six attention heads pretrained on Nepali language and finetuned it on previously unseen data. The obtained results from intrinsic and extrinsic evaluations have been compared to the results drawn from the original model baseline and a complete BERT model pretrained on Nepali language as the oracle. The results demonstrate that even though the oracle is better on average, finetuning the small embeddings drastically improves results compared to the original baseline.


Development of Pre-Trained Transformer-based Models for the Nepali Language

Thapa, Prajwal, Nyachhyon, Jinu, Sharma, Mridul, Bal, Bal Krishna

arXiv.org Artificial Intelligence

Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.


Whisper Finetuning on Nepali Language

Rijal, Sanjay, Adhikari, Shital, Dahal, Manish, Awale, Manish, Ojha, Vaghawan

arXiv.org Artificial Intelligence

Despite the growing advancements in Automatic Speech Recognition (ASR) models, the development of robust models for underrepresented languages, such as Nepali, remains a challenge. This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models of different sizes to improve transcription (speech-to-text) accuracy for the Nepali language. We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation. Our experimental results demonstrate that fine-tuning Whisper models on our curated custom dataset substantially reduces the Word Error Rate (WER) across all model sizes attributed to larger data variations in terms of speaker's age, gender, and sentiment, acoustic environment, dialect, denser audio segments (15-30 seconds) that are more compatible with Whisper's input, and manual curation of audios and transcriptions. Notably, our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models. Furthermore, we show that data augmentation plays a significant role in enhancing model robustness. Our approach underlines the importance of dataset quality, variation, and augmentation in the adaptation of state-of-the-art models to underrepresented languages for developing accurate ASR systems.


Abstractive Summarization of Low resourced Nepali language using Multilingual Transformers

Dhakal, Prakash, Baral, Daya Sagar

arXiv.org Artificial Intelligence

Automatic text summarization in Nepali language is an unexplored area in natural language processing (NLP). Although considerable research has been dedicated to extractive summarization, the area of abstractive summarization, especially for low-resource languages such as Nepali, remains largely unexplored. This study explores the use of multilingual transformer models, specifically mBART and mT5, for generating headlines for Nepali news articles through abstractive summarization. The research addresses key challenges associated with summarizing texts in Nepali by first creating a summarization dataset through web scraping from various Nepali news portals. These multilingual models were then fine-tuned using different strategies. The performance of the fine-tuned models were then assessed using ROUGE scores and human evaluation to ensure the generated summaries were coherent and conveyed the original meaning. During the human evaluation, the participants were asked to select the best summary among those generated by the models, based on criteria such as relevance, fluency, conciseness, informativeness, factual accuracy, and coverage. During the evaluation with ROUGE scores, the 4-bit quantized mBART with LoRA model was found to be effective in generating better Nepali news headlines in comparison to other models and also it was selected 34.05% of the time during the human evaluation, outperforming all other fine-tuned models created for Nepali News headline generation.


Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

Luitel, Nishant, Bekoju, Nirajan, Sah, Anand Kumar, Shakya, Subarna

arXiv.org Artificial Intelligence

Recent language models use subwording mechanisms to handle Out-of-Vocabulary(OOV) words seen during test time and, their generation capacity is generally measured using perplexity, an intrinsic metric. It is known that increasing the subword granularity results in a decrease of perplexity value. However, the study of how subwording affects the understanding capacity of language models has been very few and only limited to a handful of languages. To reduce this gap we used 6 different tokenization schemes to pretrain relatively small language models in Nepali and used the representations learned to finetune on several downstream tasks. Although byte-level BPE algorithm has been used in recent models like GPT, RoBERTa we show that on average they are sub-optimal in comparison to algorithms such as SentencePiece in finetuning performances for Nepali. Additionally, similar recent studies have focused on the Bert-based language model. We, however, pretrain and finetune sequential transformer-based language models.


Optical Text Recognition in Nepali and Bengali: A Transformer-based Approach

Hasan, S M Rakib, Dhakal, Aakar, Mehedi, Md Humaion Kabir, Rasel, Annajiat Alim

arXiv.org Artificial Intelligence

Efforts on the research and development of OCR systems for Low-Resource Languages are relatively new. Low-resource languages have little training data available for training Machine Translation systems or other systems. Even though a vast amount of text has been digitized and made available on the internet the text is still in PDF and Image format, which are not instantly accessible. This paper discusses text recognition for two scripts: Bengali and Nepali; there are about 300 and 40 million Bengali and Nepali speakers respectively. In this study, using encoder-decoder transformers, a model was developed, and its efficacy was assessed using a collection of optical text images, both handwritten and printed. The results signify that the suggested technique corresponds with current approaches and achieves high precision in recognizing text in Bengali and Nepali. This study can pave the way for the advanced and accessible study of linguistics in South East Asia.


A Comprehensive Study of the Current State-of-the-Art in Nepali Automatic Speech Recognition Systems

Ghimire, Rupak Raj, Bal, Bal Krishna, Poudyal, Prakash

arXiv.org Artificial Intelligence

In this paper, we examine the research conducted in the field of Nepali Automatic Speech Recognition (ASR). The primary objective of this survey is to conduct a comprehensive review of the works on Nepali Automatic Speech Recognition Systems completed to date, explore the different datasets used, examine the technology utilized, and take account of the obstacles encountered in implementing the Nepali ASR system. In tandem with the global trends of ever-increasing research on speech recognition based research, the number of Nepalese ASR-related projects are also growing. Nevertheless, the investigation of language and acoustic models of the Nepali language has not received adequate attention compared to languages that possess ample resources. In this context, we provide a framework as well as directions for future investigations.


COVID-19-related Nepali Tweets Classification in a Low Resource Setting

Adhikari, Rabin, Thapaliya, Safal, Basnet, Nirajan, Poudel, Samip, Shakya, Aman, Khanal, Bishesh

arXiv.org Artificial Intelligence

Billions of people across the globe have been using social media platforms in their local languages to voice their opinions about the various topics related to the COVID-19 pandemic. Several organizations, including the World Health Organization, have developed automated social media analysis tools that classify COVID-19-related tweets into various topics. However, these tools that help combat the pandemic are limited to very few languages, making several countries unable to take their benefit. While multi-lingual or low-resource language-specific tools are being developed, they still need to expand their coverage, such as for the Nepali language. In this paper, we identify the eight most common COVID-19 discussion topics among the Twitter community using the Nepali language, set up an online platform to automatically gather Nepali tweets containing the COVID-19-related keywords, classify the tweets into the eight topics, and visualize the results across the period in a web-based dashboard. We compare the performance of two state-of-the-art multi-lingual language models for Nepali tweet classification, one generic (mBERT) and the other Nepali language family-specific model (MuRIL). Our results show that the models' relative performance depends on the data size, with MuRIL doing better for a larger dataset. The annotated data, models, and the web-based dashboard are open-sourced at https://github.com/naamiinepal/covid-tweet-classification.